Shreya Mohan, Kalyani Mantirraju, Crystal
Arevalo,
Karen Alvarez, Mason Lam Group: Statistically Speaking
# install.packages("MASS")
# install.packages("dunn.test")
# install.packages("multcomp")
# install.packages("nortest")
# install.packages("rstatix")
# install.packages("mgcv")
# install.packages("gridExtra")
library(mgcv)
library(rstatix)
library(nycflights13)
library(tidyverse)
library(car)
library(dunn.test)
library(gridExtra)
library(tidyr)
library(broom)
library(multcomp)
library(nortest)
library(boot)
library(knitr)
library(gridExtra)
library(ggcorrplot)
library(MASS)
library(cowplot)
library(knitr)
The primary goal of this research is to explore factors influencing flight delays from New City airports in 2013.
Understanding factors that contribute to flight delays is critical for informing Federal Aviation Administration (FAA) policies and guiding airlines and airports in improving operational efficiency, enhancing weather preparedness, and reducing delays through controllable factors. By analyzing weather conditions, airline differences, holiday effects, fleet age, and airport specific challenges, this research can provide data-driven insights to optimize air travel and ensure compliance with aviation regulations in heavily congested areas like New York City.
##
## Missing values per column:
## tailnum year type manufacturer model engines
## 0 70 0 0 0 0
## seats speed engine
## 0 3299 0
##
## Column types and structure:
## Rows: 3,322
## Columns: 9
## $ tailnum <chr> "N10156", "N102UW", "N103US", "N104UW", "N10575", "N105UW…
## $ year <int> 2004, 1998, 1999, 1999, 2002, 1999, 1999, 1999, 1999, 199…
## $ type <chr> "Fixed wing multi engine", "Fixed wing multi engine", "Fi…
## $ manufacturer <chr> "EMBRAER", "AIRBUS INDUSTRIE", "AIRBUS INDUSTRIE", "AIRBU…
## $ model <chr> "EMB-145XR", "A320-214", "A320-214", "A320-214", "EMB-145…
## $ engines <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ seats <int> 55, 182, 182, 182, 55, 182, 182, 182, 182, 182, 55, 55, 5…
## $ speed <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ engine <chr> "Turbo-fan", "Turbo-fan", "Turbo-fan", "Turbo-fan", "Turb…
## # A tibble: 3,322 × 9
## tailnum year type manufacturer model engines seats speed engine
## <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
## 1 N10156 2004 Fixed wing multi… EMBRAER EMB-… 2 55 NA Turbo…
## 2 N102UW 1998 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## 3 N103US 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## 4 N104UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## 5 N10575 2002 Fixed wing multi… EMBRAER EMB-… 2 55 NA Turbo…
## 6 N105UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## 7 N107US 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## 8 N108UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## 9 N109UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## 10 N110UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## # ℹ 3,312 more rows
##
## First few rows:
## # A tibble: 6 × 9
## tailnum year type manufacturer model engines seats speed engine
## <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
## 1 N10156 2004 Fixed wing multi … EMBRAER EMB-… 2 55 NA Turbo…
## 2 N102UW 1998 Fixed wing multi … AIRBUS INDU… A320… 2 182 NA Turbo…
## 3 N103US 1999 Fixed wing multi … AIRBUS INDU… A320… 2 182 NA Turbo…
## 4 N104UW 1999 Fixed wing multi … AIRBUS INDU… A320… 2 182 NA Turbo…
## 5 N10575 2002 Fixed wing multi … EMBRAER EMB-… 2 55 NA Turbo…
## 6 N105UW 1999 Fixed wing multi … AIRBUS INDU… A320… 2 182 NA Turbo…
planes %>%
count(manufacturer, sort = TRUE) %>%
top_n(10) %>%
ggplot(aes(x = reorder(manufacturer, n), y = n)) +
geom_col(fill = "darkgreen") +
coord_flip() +
labs(title = "Top 10 Plane Manufacturers", x = "Manufacturer", y = "Number of Planes")
## Selecting by n
The visualization above shows the top 10 plane manufacturers present in the data-set. Boeing has the largest amount of planes with approximately 1750 planes, and Airbus has the second most with approximately 400 planes.
ggplot(planes, aes(x = year)) +
geom_histogram(binwidth = 1, fill = "midnightblue", color = "white") +
labs(title = "Distribution of Plane Manufacture Years", x = "Year", y = "Count")
This histogram shows the distribution of plane manufacture years, with the majority of planes built between the mid-1990s and early 2000s. There is a notable peak around the year 2000, indicating a surge in plane production during that period.
Dimensions and column names of the airlines dataset
##
## Missing values per column:
## carrier name
## 0 0
##
## Column types and structure:
## Rows: 16
## Columns: 2
## $ carrier <chr> "9E", "AA", "AS", "B6", "DL", "EV", "F9", "FL", "HA", "MQ", "O…
## $ name <chr> "Endeavor Air Inc.", "American Airlines Inc.", "Alaska Airline…
## # A tibble: 16 × 2
## carrier name
## <chr> <chr>
## 1 9E Endeavor Air Inc.
## 2 AA American Airlines Inc.
## 3 AS Alaska Airlines Inc.
## 4 B6 JetBlue Airways
## 5 DL Delta Air Lines Inc.
## 6 EV ExpressJet Airlines Inc.
## 7 F9 Frontier Airlines Inc.
## 8 FL AirTran Airways Corporation
## 9 HA Hawaiian Airlines Inc.
## 10 MQ Envoy Air
## 11 OO SkyWest Airlines Inc.
## 12 UA United Air Lines Inc.
## 13 US US Airways Inc.
## 14 VX Virgin America
## 15 WN Southwest Airlines Co.
## 16 YV Mesa Airlines Inc.
##
## First few rows:
## # A tibble: 6 × 2
## carrier name
## <chr> <chr>
## 1 9E Endeavor Air Inc.
## 2 AA American Airlines Inc.
## 3 AS Alaska Airlines Inc.
## 4 B6 JetBlue Airways
## 5 DL Delta Air Lines Inc.
## 6 EV ExpressJet Airlines Inc.
Viewing all the Unique Airlines:
airlines %>%
arrange(name)
# Join flights and airline names
flights_airlines <- flights %>%
left_join(airlines, by = "carrier")
# Average delay metrics
avg_delays <- flights_airlines %>%
group_by(name) %>%
summarise(
avg_dep_delay = mean(dep_delay, na.rm = TRUE),
avg_arr_delay = mean(arr_delay, na.rm = TRUE),
flights = n()
)
# Plot: Departure Delay
ggplot(avg_delays, aes(x = reorder(name, avg_dep_delay), y = avg_dep_delay)) +
geom_col(fill = "tomato") +
coord_flip() +
labs(
title = "Average Departure Delay by Airline",
x = "Airline",
y = "Avg Departure Delay (min)"
)
We can see that on average, Frontier Airlines has the most departure delay at around 20 min, with ExpressJet roughly around the same 20 minutes. Less than half the Airlines seem to be past the 13 minute delay mark.
cancel_rate <- flights_airlines %>%
mutate(cancelled = is.na(dep_delay)) %>%
group_by(name) %>%
summarise(cancel_rate = mean(cancelled), total_flights = n())
ggplot(cancel_rate, aes(x = reorder(name, cancel_rate), y = cancel_rate)) +
geom_col(fill = "orange") +
coord_flip() +
labs(
title = "Cancelation Rate by Airline",
x = "Airline",
y = "Cancelation Rate"
)
As we can see from above, Skywest Airlines Inc has the highest cancellation rate, with Mesa Airlines very closely behind, and a huge drop off at Endeavor Air Inc.
##
## Missing values per column:
## year month day dep_time sched_dep_time
## 0 0 0 8255 0
## dep_delay arr_time sched_arr_time arr_delay carrier
## 8255 8713 0 9430 0
## flight tailnum origin dest air_time
## 0 2512 0 0 9430
## distance hour minute time_hour
## 0 0 0 0
##
## Column types and structure:
## Rows: 336,776
## Columns: 19
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
##
## First few rows:
## # A tibble: 6 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Most of our analysis is based on how other variables and datasets affect and compare to the flights dataset. We are seeing how the arrival time, departure delay time, departure time, arrival delay time, and other variables are affected.
flights that were not canceled - We will be using these the not_canceled data for the rest of the EDA
not_canceled <- filter(flights, !is.na(dep_delay), !is.na(arr_delay))
not_canceled
Basic delay analysis Distribution and Proportion of delayed flights that were not canceled
#histograms
hist(not_canceled$dep_delay, breaks=100, main = "Departure Delays",xlab = "Minutes")
hist(not_canceled$arr_delay, breaks=100, main ="Arrival Delays", xlab = "Minutes")
#proportions
mean(not_canceled$dep_delay>0, na.rm=TRUE)
## [1] 0.3902446
mean(not_canceled$arr_delay>0, na.rm=TRUE)
## [1] 0.4063101
Most of the departure delays do not go over 200 minutes and the arrival delays have very few delays past 200 minutes.
#convert time to hours
not_canceled$dep_hour <-floor(not_canceled$sched_dep_time/100)
not_canceled$arr_hour <-floor(not_canceled$sched_arr_time/100)
#plot
not_canceled |>
group_by(dep_hour) |>
summarize(mean_dep_delay = mean(dep_delay, na.rm=TRUE))|>
ggplot(aes(x=dep_hour, y =mean_dep_delay))+
geom_line()+
geom_point()+
labs(title = "Average Departure Delay by Hour", x="Scheduled Departure", y="Minutes")
We can see that many of the delays happen further in the day and peak at about 18 hours and then it descends from there.
not_canceled |>
group_by(origin)|>
summarize(avg_dep_delay= mean(dep_delay, na.rm=TRUE), avg_arr_delay= mean(arr_delay, na.rm=TRUE))
EWR has the highest average departure and arrival delay followed by JFK and then LGA
not_canceled|>
group_by(carrier)|>
summarize(avg_dep_delay = mean(dep_delay, na.rm=TRUE))|>
arrange(desc(avg_dep_delay))
F9 has the highest average departure delay at 20 hours.
Check to see if the flights that were delayed made up the time in the air
not_canceled|>
mutate(made_up_time = dep_delay- arr_delay)|>
ggplot(aes(x=made_up_time))+
geom_histogram(bins=100)+
labs(title="Made up Time in Air", x= "Minutes Saved", y="Count")
We can see that the majority of the flights did not save any minutes on the arrival delay and actually ended up being delayed more. Some flights did in fact save minutes but it was less than 50% of all flights.
not_canceled |>
group_by(month) |>
summarize(mean_dep_delay = mean(dep_delay, na.rm = TRUE)) |>
ggplot(aes(x = month, y = mean_dep_delay)) +
geom_line() +
geom_point() +
labs(title = "Monthly Departure Delays", x = "Month", y = "Time (minutes)")
we see that the majority of flights are delayed from May to mid July, and there is another peak at December. The months with the shorest delays are September and October.
ggplot(not_canceled, aes(x=distance, y=dep_delay))+
geom_point(alpha=0.2)+
geom_smooth(method = "lm", se=TRUE,color= "Pink")+
labs(title = "Distance vs Departure Delay", x="Distance", y="Departure Delay")
## `geom_smooth()` using formula = 'y ~ x'
We can see that there is not much of an effect of Distance on Departure delay.
longest_Distance <- not_canceled |>
arrange(desc(distance)) |>
dplyr::select(carrier, origin, dest)
longest_Distance
We see that HA is the carrier with the longest flights and they all start at JFK airport and land at HNL.
carrier_Dest <- not_canceled |>
group_by(carrier, dest) |>
summarize(avg_arr_Delay = mean(arr_delay, na.rm = TRUE), .groups = "drop")
ggplot(carrier_Dest, aes(x = reorder(carrier, avg_arr_Delay, median), y = avg_arr_Delay)) +
geom_boxplot() +
coord_flip() +
labs(title = "Average Arrival Delay by Carrier", x = "Carrier", y = "Average Arrival Delay (minutes)") +
theme_minimal()
##
## Missing values per column:
## origin year month day hour temp dewp
## 0 0 0 0 0 1 1
## humid wind_dir wind_speed wind_gust precip pressure visib
## 1 460 4 20778 0 2729 0
## time_hour
## 0
##
## Column types and structure:
## Rows: 26,115
## Columns: 15
## $ origin <chr> "EWR", "EWR", "EWR", "EWR", "EWR", "EWR", "EWR", "EWR", "EW…
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,…
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ hour <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 16, 17, 18, …
## $ temp <dbl> 39.02, 39.02, 39.02, 39.92, 39.02, 37.94, 39.02, 39.92, 39.…
## $ dewp <dbl> 26.06, 26.96, 28.04, 28.04, 28.04, 28.04, 28.04, 28.04, 28.…
## $ humid <dbl> 59.37, 61.63, 64.43, 62.21, 64.43, 67.21, 64.43, 62.21, 62.…
## $ wind_dir <dbl> 270, 250, 240, 250, 260, 240, 240, 250, 260, 260, 260, 330,…
## $ wind_speed <dbl> 10.35702, 8.05546, 11.50780, 12.65858, 12.65858, 11.50780, …
## $ wind_gust <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 20.…
## $ precip <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ pressure <dbl> 1012.0, 1012.3, 1012.5, 1012.2, 1011.9, 1012.4, 1012.2, 101…
## $ visib <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,…
## $ time_hour <dttm> 2013-01-01 01:00:00, 2013-01-01 02:00:00, 2013-01-01 03:00…
## # A tibble: 26,115 × 15
## origin year month day hour temp dewp humid wind_dir wind_speed
## <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4
## 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06
## 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5
## 4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7
## 5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7
## 6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5
## 7 EWR 2013 1 1 7 39.0 28.0 64.4 240 15.0
## 8 EWR 2013 1 1 8 39.9 28.0 62.2 250 10.4
## 9 EWR 2013 1 1 9 39.9 28.0 62.2 260 15.0
## 10 EWR 2013 1 1 10 41 28.0 59.6 260 13.8
## # ℹ 26,105 more rows
## # ℹ 5 more variables: wind_gust <dbl>, precip <dbl>, pressure <dbl>,
## # visib <dbl>, time_hour <dttm>
##
## First few rows:
## # A tibble: 6 × 15
## origin year month day hour temp dewp humid wind_dir wind_speed wind_gust
## <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4 NA
## 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06 NA
## 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5 NA
## 4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7 NA
## 5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7 NA
## 6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5 NA
## # ℹ 4 more variables: precip <dbl>, pressure <dbl>, visib <dbl>,
## # time_hour <dttm>
Related Questions from our Proposal: 1) How do weather conditions affect flight delays? 2) How do environmental factors like humidity, visibility, and wind affect flight delays? 3) What impact does precipitation have on specific airports and weather-related delays?
# Average Temperature by Month
monthlyavgtemp <- weather %>%
group_by(month) %>%
summarise(monthlyavgtemp = mean(temp, na.rm = TRUE))
ggplot(data = monthlyavgtemp,
aes(x = factor(month), y = monthlyavgtemp)) +
geom_col(color = "skyblue", fill = "lightskyblue1") +
labs(title = "Average Temperature by Month",
x = "Month",
y = "Average Temperature (Fº)") +
theme_minimal()
We can see that the average temperature ranges from about 70-80º in the summer, and 35-40º in the winter. When answering our research questions, we can see if there is a correlation between summer/winter weather and flight delays.
# Average Visibility by Month
monthlyavgvisib <- weather %>%
group_by(month) %>%
summarise(monthlyavgvisib = mean(visib, na.rm = TRUE))
ggplot(data = monthlyavgvisib) +
geom_col(aes(x = factor(month), y = monthlyavgvisib),
color = "orchid4", fill = "plum") +
labs(title = "Average Visibility by Month",
x = "Month",
y = "Average Visibility") +
theme_minimal()
ggplot(weather, aes(x = factor(month), y = visib)) +
geom_boxplot(fill = "plum", color = "orchid4") +
labs(title = "Distribution of Visibility by Month",
x = "Month",
y = "Visibility") +
coord_flip() +
theme_minimal()
ggplot(filter(weather, visib < 10), aes(x = factor(month), y = visib)) +
geom_boxplot(fill = "plum", color = "orchid4") +
labs(title = "Distribution of Visibility < 10 by Month",
x = "Month",
y = "Visibility") +
coord_flip() +
theme_minimal()
The average visibility does not greatly vary by month looking at the average value. However, we can see that there is slightly less visiblity in winter months. Looking at the boxplots, we can see that there are a lot of outliers. Removing these outliers and focusing on visib < 10 shows us a better distribution of visiblity. When answering our research questions, we can compare the average visibility during flight delays vs average visibility without flight delays to further explore the role of visibility in flight delays.
# Distribution of Humidity by Month
ggplot(weather, aes(x = factor(month), y = humid)) +
geom_boxplot(fill = "darkseagreen2", color = "darkgreen") +
labs(title = "Distribution of Humidity by Month (With Outliers)",
x = "Month",
y = "Humidity") +
coord_flip() +
theme_minimal()
The distribution of humidity varies by month, but there does not seem to be significant differences. We can further explore the role of humidity by comparing it to other weather variables and flight delays.
# Precipitation by Month
monthlyprecip <- weather %>%
group_by(month) %>%
summarise(avgmonthlyprecip = mean(precip, na.rm = TRUE))
ggplot(monthlyprecip, aes(x = month, y = avgmonthlyprecip)) +
geom_line(color = "palevioletred2") +
geom_point(color = "palevioletred4") +
labs(title = "Average Monthly Precipitation",
x = "Month",
y = "Average Precipitation") +
scale_x_continuous(breaks = 1:12) +
theme_minimal()
The average precipitation for each month varies greatly. We can see that spring months have the greatest average precipitation. When answering our research question, we can see if greater precipitation correlates to flight delays.
# Correlation Between Variables
cor(weather$precip, weather$visib, use = "complete.obs")
## [1] -0.3199118
cor(weather$humid, weather$visib, use = "complete.obs")
## [1] -0.5167424
We can explore the correlation between different weather variables and see how they may work together to impact flight delays.
cat("\nMissing values per column:\n")
##
## Missing values per column:
print(colSums(is.na(airports)))
## faa name lat lon alt tz dst tzone
## 0 0 0 0 0 0 0 3
cat("\nColumn types and structure:\n")
##
## Column types and structure:
print(glimpse(airports))
## Rows: 1,458
## Columns: 8
## $ faa <chr> "04G", "06A", "06C", "06N", "09J", "0A9", "0G6", "0G7", "0P2", "…
## $ name <chr> "Lansdowne Airport", "Moton Field Municipal Airport", "Schaumbur…
## $ lat <dbl> 41.13047, 32.46057, 41.98934, 41.43191, 31.07447, 36.37122, 41.4…
## $ lon <dbl> -80.61958, -85.68003, -88.10124, -74.39156, -81.42778, -82.17342…
## $ alt <dbl> 1044, 264, 801, 523, 11, 1593, 730, 492, 1000, 108, 409, 875, 10…
## $ tz <dbl> -5, -6, -6, -5, -5, -5, -5, -5, -5, -8, -5, -6, -5, -5, -5, -5, …
## $ dst <chr> "A", "A", "A", "A", "A", "A", "A", "A", "U", "A", "A", "U", "A",…
## $ tzone <chr> "America/New_York", "America/Chicago", "America/Chicago", "Ameri…
## # A tibble: 1,458 × 8
## faa name lat lon alt tz dst tzone
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/…
## 2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A America/…
## 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/…
## 4 06N Randall Airport 41.4 -74.4 523 -5 A America/…
## 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/…
## 6 0A9 Elizabethton Municipal Airport 36.4 -82.2 1593 -5 A America/…
## 7 0G6 Williams County Airport 41.5 -84.5 730 -5 A America/…
## 8 0G7 Finger Lakes Regional Airport 42.9 -76.8 492 -5 A America/…
## 9 0P2 Shoestring Aviation Airfield 39.8 -76.6 1000 -5 U America/…
## 10 0S9 Jefferson County Intl 48.1 -123. 108 -8 A America/…
## # ℹ 1,448 more rows
cat("\nFirst few rows:\n")
##
## First few rows:
print(head(airports))
## # A tibble: 6 × 8
## faa name lat lon alt tz dst tzone
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/Ne…
## 2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A America/Ch…
## 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/Ch…
## 4 06N Randall Airport 41.4 -74.4 523 -5 A America/Ne…
## 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/Ne…
## 6 0A9 Elizabethton Municipal Airport 36.4 -82.2 1593 -5 A America/Ne…
# Looking at the variation of flight delays between airports
flights_mod <- flights |>
mutate(month = factor(month, levels = 1:12, labels = month.abb))
delays_airport <- flights_mod |>
group_by(origin) |>
summarise(mean_delay = mean(dep_delay, na.rm = TRUE),
median_delay = median(dep_delay, na.rm = TRUE),
sd_delay = sd(dep_delay, na.rm = TRUE))
airport_avg_delay <- ggplot(delays_airport, aes(x = origin, y = mean_delay, fill = origin)) +
geom_bar(stat = "identity", show.legend = FALSE) +
labs(title = "Average Flight Delays by Airport", x = "Airport", y = "Average Delay (minutes)")
# Looking at variation of cancelled flights by airports over a period of 12 months
cancelled_flights <- flights_mod |>
filter(!is.na(month)) |>
group_by(origin, month) |>
summarise(cancelled = sum(is.na(dep_time)))
## `summarise()` has grouped output by 'origin'. You can override using the
## `.groups` argument.
airport_cancelled_month <- ggplot(cancelled_flights, aes(x = month, y = cancelled, fill = origin)) +
geom_bar(stat = "identity", position = "dodge", show.legend = FALSE) +
facet_wrap(~origin) +
coord_flip() +
labs(title = "Cancelled flights by Airport and by Month", x = "Month", y = "Number of Cancelled Flights")
# Looking at the flight patterns for each airport over a period of 12 months
flight_patterns <- flights_mod |>
group_by(origin, month) |>
summarise(avg_flights = n() / length(unique(day)), .groups = "drop")
airport_avg_flight_month <- ggplot(flight_patterns, aes(x = month, y = avg_flights, color = origin, group = origin)) +
geom_line() +
labs(title = "Average Flight Patterns from Airport by Month", x = "Month", y = "Average Number of Flights")
# Looking at carriers departing from airports over a period of 12 months
flights_with_airlines <- flights_mod |>
left_join(airlines, by = "carrier")
carrier_variation <- flights_with_airlines |>
group_by(origin, name, month)
airport_carriers_month <- ggplot(carrier_variation, aes(x = name, y = month, fill = name)) +
geom_bar(stat = "identity", position = "dodge", show.legend = FALSE) +
facet_wrap(~origin) +
coord_flip() +
labs(title = "Variation of Carriers Departing from Airports", x = "Carrier", y = "Number of Flights")
# Plots
airport_avg_delay
airport_cancelled_month
airport_avg_flight_month
airport_carriers_month
1: When looking at the Average Flight Delays we can see that Newark Liberty Airport (EWR) has the highest flight delays averaging at 15 minutes, followed by John F. Kennedy Airport (JFK) and LaGuardia Airport (LGA). Based on this information, further investigation into potential causes like weather and airline operations could provide more information concerning the delays at EWR.
When looking at cancelled flights by airport and month, distinct patterns emerge for EWR, JFK, and LGA. EWR experiences a higher frequency of cancellations throughout the year, with notable peaks in February, March, and December. This may be attributed to winter weather and holiday congestion. JFK generally has fewer cancellations, expect for a significant spike in February, this may also be weather-related as this pattern is seen in EWR. LGA shows variability in cancellations, with higher instances during the summer months and December, possibly influenced by seasonal weather and holiday travel. These patterns suggest that weather conditions and holiday would be worth investigating to better understand their effects on flight delays.
The line plot illustrating average flight patterns by month reveals distinct seasonal trends for each airport. EWR exhibits a significant peak in April, averaging 350 flights, followed by a gradual decrease for the remainder of the year. JFK displays more fluctuation throughout the year, with its highest average number of flights, just under 325, occurring during the summer months, followed by a decrease and a slight uptick during the holiday season. LGA, while experiencing the lowest overall flight numbers, shows a gradual incrase with a peak in October, followed by a decline. These patterns may suggest that EWR and JFK, as international airports, are more influenced by seasonal travel trends, whereas LGA, with its domestic focus, shows a more consistent traffic flow.
The bar chart shows the distribution of flights by carrier for each airport. EWR and JFK have a variety of carriers, while LGA shows a more concentrated distribution with fewer carriers. The variety of carriers at EWR and JFK may contribute to increased operational complexity and may potentially lead to higher delays. While LGA’s more concentrated carriers might allow for efficient operations, however this isn’t the case as LGA experienced a higher number of cancelled flights. This would be interesting to investigate is it implies that factors other than weather, such as operational or logistical challenges, could be influencing flight delays and cancellations.
Assumptions: All variables are independent
The process of analysis will involve data cleaning after forming our question, basic exploration of the data, comparison of certain datasets with other datasets, visualization of the data, and an interpretation of the data/results. Cleaning of the data will deal with tasks like handling empty cells/columns and NA values. When it comes to exploratory data analysis, we plan on using tools such as histograms and boxplots to gain an understanding of the data and identify patterns and relationships. The statistical analysis that we plan on performing with the data will most likely involve making comparisons between groups to compare airlines, times, and other metrics to make our overall claim. For example, we might be comparing trends in time performance by weeks or month between different airlines to gain a better understanding of how differences in airlines affect delays. In terms of data visualization, we will most likely be using line graphs for trends over time when it comes to comparing flight time under different variables and heatmaps/scatterplots for flight delays to help communicate our findings. Finally, interpretation of the data will involve us answering the proposed question by summarizing our statistics/findings as well as through the presentation of graphical evidence.
To answer this question, we can explore factors such as airlines, engines,
\(H_0\): All airlines have the same average delay.
\(H_A\): All airlines do not have the same average delay.
We can test this by comparing the means of the different airlines.
flights_not_missing = flights %>%
filter(!is.na(arr_delay))
flights_not_missing = flights_not_missing %>%
left_join(airlines, by = "carrier")
top_airlines = flights_not_missing %>%
count(name, sort = TRUE) %>%
top_n(5) %>%
pull(name)
## Selecting by n
flights_subset = flights_not_missing %>%
filter(name %in% top_airlines)
# Testing assumptions before ANOVA
print("Levene Test for Equal Variances: ")
## [1] "Levene Test for Equal Variances: "
leveneTest(arr_delay ~ name, data = flights_subset)
# Cannot use Shapiro-Wilk to test for normality due to large sample
# using ad test instead
ad.test(flights_subset$arr_delay)
##
## Anderson-Darling normality test
##
## data: flights_subset$arr_delay
## A = 17700, p-value < 2.2e-16
length(flights_subset$arr_delay)
## [1] 242544
Before attempting to test our hypotheses with ANOVA, we check on the assumptions of equal variances and normality. Both are violated, but we can bypass the normality violation because of the large sample size. This means we can use a Welch Anova test instead of the regular Anova, which assumes that the variances are not equal.
# Welch Anova Test
print("Welch Anova Test")
## [1] "Welch Anova Test"
oneway.test(arr_delay ~ name, data = flights_subset, var.equal = FALSE)
##
## One-way analysis of means (not assuming equal variances)
##
## data: arr_delay and name
## F = 887.09, num df = 4, denom df = 113281, p-value < 2.2e-16
# Games_howell in place of Tukey
flights_subset %>%
games_howell_test(arr_delay ~ name)
The F statistic is extremely high, suggesting there is definitely a difference between means. We can reject our null. Since we are using Welch Anova, we can use the Games-Howell test in place of the Tukey test to observe the differences between each airline.
The results from the Games-Howell test show us that there is a stark difference between the means of each airline (0 does not exist within any confidence interval and the p-values are very low). So to answer our question: yes, some airlines have more delays than others.
\(H_0\): Arrival delay does not vary between different engine types.
\(H_A\): Arrival delay does vary between different engine types.
We can use ANOVA to compare different engine types. However, we first need to test the assumptions.
flights_engines = flights %>%
filter(!is.na(arr_delay)) %>%
left_join(planes, by = "tailnum") %>%
filter(!is.na(engine))
table(flights_engines$engine)
##
## 4 Cycle Reciprocating Turbo-fan Turbo-jet Turbo-prop
## 47 1703 236084 40736 46
## Turbo-shaft
## 401
# Testing assumptions before ANOVA
print("Levene Test for Equal Variances: ")
## [1] "Levene Test for Equal Variances: "
leveneTest(arr_delay ~ engine, data = flights_engines)
# Cannot use Shapiro-Wilk to test for normality due to large sample
# using ad test instead
ad.test(flights_engines$arr_delay)
##
## Anderson-Darling normality test
##
## data: flights_engines$arr_delay
## A = 20772, p-value < 2.2e-16
By testing the assumptions of equal variances and normality, we can see that both assumptions are violated. However, due to the large sample size, we can bypass the normality assumption. We can therefore use the Welch ANOVA test to account for the unequal variances.
oneway.test(arr_delay ~ engine, data = flights_engines, var.equal = FALSE)
##
## One-way analysis of means (not assuming equal variances)
##
## data: arr_delay and engine
## F = 77.48, num df = 5.00, denom df = 248.04, p-value < 2.2e-16
flights_engines %>%
games_howell_test(arr_delay ~ engine)
The results from the Games-Howell test show us that most of the different engine types do not have a different mean arrival delay. However, there is a difference between turbo-fan vs turbo-jet engines. This suggests that for most engine types, arrival delay does not vary significantly between engines. We can check the most common type of engine for each airlines.
flights_engines_named = flights_engines %>%
left_join(airlines, by = "carrier")
# Find most common engine type per airline
most_common_engines = flights_engines_named %>%
group_by(name, engine) %>%
summarise(count = n(), .groups = "drop") %>%
arrange(name, desc(count)) %>%
group_by(name) %>%
slice(1)
most_common_engines
It looks like all the airlines mostly use Turbo-fan engines, which means we don’t have much evidence to connect different engine types with the arrival delays of different airlines. We fail to reject the null hypothesis.
\(H_0\): Cancellation rate is the same across all airlines.
\(H_A\): Cancellation rate differs between at least some airlines.
We can use a chi-squared test to check our hypotheses.
flights_cancel = flights %>%
mutate(cancelled = is.na(dep_time)) %>%
left_join(airlines, by = "carrier")
cancel_table = table(flights_cancel$name, flights_cancel$cancelled)
cancel_table
##
## FALSE TRUE
## AirTran Airways Corporation 3187 73
## Alaska Airlines Inc. 712 2
## American Airlines Inc. 32093 636
## Delta Air Lines Inc. 47761 349
## Endeavor Air Inc. 17416 1044
## Envoy Air 25163 1234
## ExpressJet Airlines Inc. 51356 2817
## Frontier Airlines Inc. 682 3
## Hawaiian Airlines Inc. 342 0
## JetBlue Airways 54169 466
## Mesa Airlines Inc. 545 56
## SkyWest Airlines Inc. 29 3
## Southwest Airlines Co. 12083 192
## United Air Lines Inc. 57979 686
## US Airways Inc. 19873 663
## Virgin America 5131 31
chisq_test_result = chisq.test(cancel_table)
## Warning in chisq.test(cancel_table): Chi-squared approximation may be incorrect
chisq_test_result
##
## Pearson's Chi-squared test
##
## data: cancel_table
## X-squared = 4997.8, df = 15, p-value < 2.2e-16
Since the p-value is low, we can say that there is evidence that cancellation rate does vary across different airlines and we can reject the null hypothesis.
\(H_0\): Mean speed is the same across all airlines.
\(H_A\): At least one airline has a different mean speed from the others.
We can test this using ANOVA, first checking assumptions of normality and equal variances.
flights_speed = flights %>%
filter(!is.na(air_time), air_time > 0, !is.na(distance)) %>%
mutate(speed = distance / (air_time / 60)) %>%
left_join(airlines, by = "carrier")
# Testing equal variances with Levene's
leveneTest(speed ~ name, data = flights_speed)
# Testing for normality with ad test
ad.test(flights_speed$speed)
##
## Anderson-Darling normality test
##
## data: flights_speed$speed
## A = 2978.2, p-value < 2.2e-16
Like the previous questions, we can bypass normality and use a Welch ANOVA test for unequal variances.
oneway.test(speed ~ name, data = flights_speed)
##
## One-way analysis of means (not assuming equal variances)
##
## data: speed and name
## F = 7733.1, num df = 15.0, denom df = 1931.8, p-value < 2.2e-16
flights_speed %>%
games_howell_test(speed ~ name)
The results show us that there is a drastic evidence to show a difference between speed for every airline. All the p-values are below 0.05, so we can reject the null hypothesis and say that speed does vary across different airlines. We can further see how speed interacts with flight delays by testing the correlation.
First, we can see how speed affects delays by each airline.
speed_delay_summary = flights_speed %>%
filter(!is.na(arr_delay)) %>%
group_by(name) %>%
summarise(
avg_speed = mean(speed, na.rm = TRUE),
avg_arr_delay = mean(arr_delay, na.rm = TRUE)
)
speed_delay_summary
cor.test(speed_delay_summary$avg_speed, speed_delay_summary$avg_arr_delay)
##
## Pearson's product-moment correlation
##
## data: speed_delay_summary$avg_speed and speed_delay_summary$avg_arr_delay
## t = -2.3331, df = 14, p-value = 0.03507
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.81187916 -0.04529422
## sample estimates:
## cor
## -0.5291194
ggplot(speed_delay_summary, aes(x = avg_speed, y = avg_arr_delay, label = name)) +
geom_point() +
geom_text(nudge_y = 2, size = 3) +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Average Speed vs. Arrival Delay by Airline",
x = "Average Speed (mph)",
y = "Average Arrival Delay (min)")
## `geom_smooth()` using formula = 'y ~ x'
This correlation confidence interval does not include 0 but it does come close, and it shows us that the correlation between speed and delay by airline is about -0.529. This suggests that airlines with more speed have less delay. Note that there are some outliers, since we only have a few different airlines to observe.
To provide some insight into our overall findings, we found that: (1) Different airlines have different delays on average, (2) Engine type does not seem to vary across airlines, and it doesn’t seem to have a significant impact on delays, (3) Average cancellation rates vary between airlines, and (4) Airlines with faster mean speeds tend to experience lower arrival delays on average.
\(H_0\): There is no difference in the distribution of arrival delays during major holidays compared to other days.
\(H_1\): Arrival delays are different during major holidays compared to other days.
#identifying holiday periods
holidayDates<- as.Date(c(
"2013-11-27", #day before thanksgiving
"2013-11-28", #thanksgiving 2013
"2013-12-24", #day before christmas
"2013-12-25", #christmas 2013
"2014-12-31", #new years eve
"2014-01-01" #new years 2014
))
flights_holiday<-flights|>
mutate(holiday_flag=ifelse(as.Date(time_hour) %in% holidayDates, "Holiday Period", "Non-Holiday"))|>
filter(!is.na(arr_delay))
table(flights_holiday$holiday_flag)
##
## Holiday Period Non-Holiday
## 3330 324016
Departure Delays
ggplot(flights_holiday, aes(x=holiday_flag, y=dep_delay, fill= holiday_flag))+
geom_boxplot(outlier.size = 0.5, alpha=0.7)+
labs(title= "Departure Delay Distributions: Holiday Period vs. NonHoliday Flights",
x="", y="Departure Delay (minutes)")+
theme(legend.position = "none")
set.seed(707)
normality_test<-flights_holiday|> #normality test per group
group_by(holiday_flag)|>
summarise(sample_delays= list(sample(dep_delay, min(5000,n()), replace = FALSE)),
shapiro_p = shapiro.test(unlist(sample_delays))$p.value)|>
ungroup()|>
dplyr::select(holiday_flag, shapiro_p)
print(normality_test)
## # A tibble: 2 × 2
## holiday_flag shapiro_p
## <chr> <dbl>
## 1 Holiday Period 2.10e-67
## 2 Non-Holiday 3.38e-78
flights_holiday$holiday_flag<-as.factor(flights_holiday$holiday_flag)
leveneTest(dep_delay~holiday_flag, data=flights_holiday)#levenes test for homogeneity of variance
holiday_test<-wilcox.test(dep_delay~holiday_flag, data=flights_holiday)#Wilcoxon Rank Sum Test
print(holiday_test)
##
## Wilcoxon rank sum test with continuity correction
##
## data: dep_delay by holiday_flag
## W = 583916914, p-value = 2.42e-16
## alternative hypothesis: true location shift is not equal to 0
cat("Wilcoxon test p-value: ", signif(holiday_test$p.value,4))
## Wilcoxon test p-value: 2.42e-16
Departure Conclusion: Departure delays has a low p-value for both holiday and non holiday periods which means that the distributions are not normal in either group, this leads us to use a non-parametric test such as the Levene test and Wilcoxon rank sum test. For departure delays, the Levene’s test p=0.4355 which is not significant meaning that variances are roughly equal between holiday and non holiday groups for departure delays. The Wilcoxon Rank Sum test p-value is approximately \(7.33 \times 10^{-10}\) which is significant and allows us to conclude that departure delays are significant between holiday and non holiday periods.
#Arrival Delays
ggplot(flights_holiday, aes(x=holiday_flag, y=arr_delay, fill= holiday_flag))+
geom_boxplot(outlier.size = 0.5, alpha=0.7)+
labs(title= "Arrival Delay Distributions: Holiday Period vs. NonHoliday Flights",
x="", y="Arrival Delay (minutes)")+
theme(legend.position = "none")
set.seed(707)
normality_test<-flights_holiday|> #normality test per group
group_by(holiday_flag)|>
summarise(sample_delays= list(sample(arr_delay, min(5000,n()), replace = FALSE)),
shapiro_p = shapiro.test(unlist(sample_delays))$p.value)|>
ungroup()|>
dplyr::select(holiday_flag, shapiro_p)
print(normality_test)
## # A tibble: 2 × 2
## holiday_flag shapiro_p
## <fct> <dbl>
## 1 Holiday Period 9.49e-57
## 2 Non-Holiday 1.14e-67
flights_holiday$holiday_flag<-as.factor(flights_holiday$holiday_flag)
leveneTest(arr_delay~holiday_flag, data=flights_holiday)#levenes test for homogeneity of variance
holiday_test<-wilcox.test(arr_delay~holiday_flag, data=flights_holiday)#Wilcoxon Rank Sum Test
print(holiday_test)
##
## Wilcoxon rank sum test with continuity correction
##
## data: arr_delay by holiday_flag
## W = 572896209, p-value = 7.329e-10
## alternative hypothesis: true location shift is not equal to 0
cat("Wilcoxon test p-value: ", signif(holiday_test$p.value,4))
## Wilcoxon test p-value: 7.329e-10
Arrival Conclusion: Arrival delays has a low p-value for both holiday and non holiday periods which means that the distributions are not normal in either group, this leads us to use a non-parametric test such as the Levene test and Wilcoxon rank sum test. For arrival delays, the Levene’s test p=0.04781 which indicates that the variance differs significantly between holiday and non holiday groups for arrival delays. The Wilcoxon Rank Sum test p-value is ~7.33e-10 which is significant and allows us to conclude that departure delays are significant between holiday and non holiday periods.
Overall Conclusion: We can conclude that Departure and Arrival flight delays occur more frequently or are more severe during major holiday periods, defined as the day before and day of the holiday. The analysis shows that arrival delays increase and exhibit significantly greater variability during holiday times, indicating inconsistent and unpredictable arrival times around major holidays. Although departure delays show similar mean increases during holidays, the variance remains stable suggesting that delays at departure are consistently worse but not erratic. This analysis highlights the operational challenges airlines and airports face during major holidays, emphasizing the need for planning and resource allocations to mitigate such delays during holiday periods.
Within this main question we will perform hypothesis tests to answer the two following sub-questions:
\(H_0\): There is no difference in the distribution of arrival delays across the different plane age groups.
\(H_1\): At least one plane age group has a different distribution of arrival delays compared to the others.
# Join flights with planes to get plane manufacture year
planes_fixed <- planes %>%
rename(plane_year = year)
flights_planes <- flights %>%
inner_join(planes %>% rename(plane_year = year), by = "tailnum") %>%
filter(!is.na(plane_year), !is.na(arr_delay)) %>%
mutate(
plane_age = 2013 - plane_year,
age_group = case_when(
plane_age < 10 ~ "0-9 yrs",
plane_age < 20 ~ "10-19 yrs",
plane_age < 30 ~ "20-29 yrs",
TRUE ~ "30+ yrs"
)
)
head(flights_planes)
The boxplot shows that arrival delays are fairly similar across all plane age groups, with comparable medians and interquartile ranges. While each group has extreme outliers, there is no clear visual trend suggesting that older planes experience more delays than newer ones. We will run a hypothesis test to further our findings.
flights_planes$age_group <- as.factor(flights_planes$age_group)
summary_stats <- flights_planes %>%
group_by(age_group) %>%
summarise(
mean_delay = mean(arr_delay, na.rm = TRUE),
count = n()
)
print(summary_stats)
## # A tibble: 4 × 3
## age_group mean_delay count
## <fct> <dbl> <int>
## 1 0-9 yrs 7.36 103366
## 2 10-19 yrs 7.61 133479
## 3 20-29 yrs 4.00 35412
## 4 30+ yrs 5.54 1596
# Normality test with sample per group
set.seed(123)
normality_test <- flights_planes %>%
group_by(age_group) %>%
summarise(
sample_delays = list(sample(arr_delay[!is.na(arr_delay)], min(5000, n()), replace = FALSE)),
shapiro_p = shapiro.test(unlist(sample_delays))$p.value
) %>%
dplyr::select(-sample_delays)
print(normality_test)
## # A tibble: 4 × 2
## age_group shapiro_p
## <fct> <dbl>
## 1 0-9 yrs 5.81e-68
## 2 10-19 yrs 1.49e-69
## 3 20-29 yrs 4.87e-75
## 4 30+ yrs 2.11e-45
# Levene's Test for homogeneity of variances
leveneTest(arr_delay ~ age_group, data = flights_planes)
# Since homogeneity is violated, run Kruskal-Wallis test (non-parametric)
kruskal.test(arr_delay ~ age_group, data = flights_planes)
##
## Kruskal-Wallis rank sum test
##
## data: arr_delay by age_group
## Kruskal-Wallis chi-squared = 514.57, df = 3, p-value < 2.2e-16
# post-hoc test for Kruskal-Wallis (Dunn test) if significant
dunn.test(flights_planes$arr_delay, flights_planes$age_group, method = "bonferroni")
## Kruskal-Wallis rank sum test
##
## data: x and group
## Kruskal-Wallis chi-squared = 514.566, df = 3, p-value = 0
##
##
## Comparison of x by group
## (Bonferroni)
## Col Mean-|
## Row Mean | 0-9 yrs 10-19 yr 20-29 yr
## ---------+---------------------------------
## 10-19 yr | -2.397838
## | 0.0495
## |
## 20-29 yr | 19.95894 22.22150
## | 0.0000* 0.0000*
## |
## 30+ yrs | 0.872657 1.268699 -3.942415
## | 1.0000 0.6136 0.0002*
##
## alpha = 0.05
## Reject Ho if p <= alpha/2
The Kruskal-Wallis test revealed a highly significant difference in arrival delays across plane age groups (p < 2.2e-16), indicating that at least one group’s delay distribution differs from the others. Post-hoc pairwise comparisons using Dunn’s test with Bonferroni correction showed that planes aged 20–29 years experience significantly different delay patterns compared to both the 0–9 and 10–19 year groups (adjusted p-values < 0.001). Additionally, planes aged 30+ years differ significantly from the 20–29 year group (adjusted p = 0.0002), but do not differ significantly from the younger 0–9 or 10–19 year groups. The difference between the 10–19 and 0–9 year groups was borderline significant (adjusted p = 0.0495). In summary, planes aged 20–29 years tend to have notably different arrival delays compared to most other age groups, highlighting a possible link between this age range and on-time performance issues.
\(H_0\): There is no difference in the distribution of arrival delays across different plane models or manufacturers.
\(H_1\): At least one plane model or manufacturer has a different distribution of arrival delays compared to the others.
flights_manufacturer <- flights %>%
inner_join(planes, by = "tailnum") %>%
filter(!is.na(manufacturer), !is.na(arr_delay)) %>%
group_by(manufacturer) %>%
filter(n() > 50)
# filter manufacturers with more than 50 flights
flights_manufacturer <- flights %>%
inner_join(planes, by = "tailnum") %>%
filter(!is.na(manufacturer), !is.na(arr_delay)) %>%
group_by(manufacturer) %>%
filter(n() > 50) %>%
ungroup()
flights_manufacturer$manufacturer <- as.factor(flights_manufacturer$manufacturer)
# Normality test per manufacturer group
set.seed(123)
normality_test <- flights_manufacturer %>%
group_by(manufacturer) %>%
summarise(
sample_delays = list(sample(arr_delay, min(5000, n()), replace = FALSE)),
shapiro_p = shapiro.test(unlist(sample_delays))$p.value
) %>%
dplyr::select(-sample_delays)
print(normality_test)
## # A tibble: 22 × 2
## manufacturer shapiro_p
## <fct> <dbl>
## 1 AIRBUS 1.03e-66
## 2 AIRBUS INDUSTRIE 1.17e-71
## 3 BARKER JACK L 1.20e-15
## 4 BELL 2.94e-11
## 5 BOEING 4.68e-67
## 6 BOMBARDIER INC 8.57e-69
## 7 CANADAIR 2.56e-47
## 8 CANADAIR LTD 1.16e-11
## 9 CESSNA 1.67e-30
## 10 CIRRUS DESIGN CORP 1.70e-22
## # ℹ 12 more rows
# Levene's Test for homogeneity of variance
leveneTest(arr_delay ~ manufacturer, data = flights_manufacturer)
# Kruskal-Wallis test
kruskal_result <- kruskal.test(arr_delay ~ manufacturer, data = flights_manufacturer)
print(kruskal_result)
##
## Kruskal-Wallis rank sum test
##
## data: arr_delay by manufacturer
## Kruskal-Wallis chi-squared = 3770.9, df = 21, p-value < 2.2e-16
if (kruskal_result$p.value < 0.05) {
dunn_output <- capture.output(
dunn_result <- dunn.test(
flights_manufacturer$arr_delay,
flights_manufacturer$manufacturer,
kw = FALSE,
list = TRUE,
rmc = FALSE,
alpha = 0.05
)
)
sig_comparisons <- data.frame(
comparison = dunn_result$comparisons,
p_value = dunn_result$P.adjusted
) %>%
filter(p_value < 0.05) %>%
arrange(p_value) %>%
slice_head(n = 5) %>% # Show only top 5
mutate(p_value = signif(p_value, 4)) # Optional: round p-values
cat("Kruskal-Wallis p-value:", signif(kruskal_result$p.value, 4), "\n")
cat("Significant differences found. Top 5 manufacturer pairs with different arrival delays:\n\n")
print(sig_comparisons, row.names = FALSE)
} else {
cat("Kruskal-Wallis p-value:", signif(kruskal_result$p.value, 4), "\n")
cat("No significant differences found among manufacturers.\n")
}
## Kruskal-Wallis p-value: 0
## Significant differences found. Top 5 manufacturer pairs with different arrival delays:
##
## comparison p_value
## AIRBUS INDUSTRIE - EMBRAER 0.000e+00
## BOEING - EMBRAER 0.000e+00
## BOMBARDIER INC - EMBRAER 0.000e+00
## AIRBUS - EMBRAER 1.240e-223
## EMBRAER - MCDONNELL DOUGLAS 2.626e-188
The Kruskal-Wallis test indicates a highly significant difference in arrival delays across plane manufacturers (\(\chi^2\) = 3770.9, df = 21, p < 2.2e-16). This strongly suggests that not all manufacturers have the same on-time performance. The post-hoc Dunn’s test with Bonferroni correction highlights specific pairwise differences. Notably, comparisons involving Embraer show consistently significant differences with Airbus Industrie, Boeing, Bombardier Inc, and McDonnell Douglas, indicating that Embraer aircraft tend to have distinct arrival delay patterns compared to these manufacturers. Additionally, Airbus and McDonnell Douglas pairs also appear among the most significant differences. These results indicate that certain manufacturers, especially Embraer, are associated with notably different arrival delay profiles, reflecting differences in punctuality or operational factors across manufacturers.
\(H_0\): Environmental factors like humidity, visibility, and wind do not affect flight delays across airports.
\(H_1\): At least one of the environmental factors like humidity, visibility, and wind affect flight delays.
To address this question, a correlation matrix was utilized to explore the relationship between departure delays and various environmental factors. Initially, the dataset was filtered to exclude any missing values related to departure delay and air time. Following this, the flight data was merged with a weather dataset subset that included key environmental factors such as humidity, visibility, wind speed, air density, and wind gust.
To prepare for further analysis and cross-validation, the merged dataset, denoted as ‘flights_weather’, was grouped by airport origin. The airports included in the analysis were Newark Liberty Airport (EWR), LaGuardia Airport (LGA), and John F. Kennedy Airport (JFK). A sample of 1,000 observations was randomly selected from each airport group using the ‘sample_n’ function. This sampling process was conducted to create balanced and manageable traing data sets for modeling and cross-validation procedures.
## year month day dep_time sched_dep_time
## 0 0 0 0 0
## dep_delay arr_time sched_arr_time arr_delay carrier
## 0 0 0 0 0
## flight tailnum origin dest air_time
## 0 0 0 0 0
## distance hour minute time_hour
## 0 0 0 0
## tibble [327,346 × 19] (S3: tbl_df/tbl/data.frame)
## $ year : int [1:327346] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ month : int [1:327346] 1 1 1 1 1 1 1 1 1 1 ...
## $ day : int [1:327346] 1 1 1 1 1 1 1 1 1 1 ...
## $ dep_time : int [1:327346] 517 533 542 544 554 554 555 557 557 558 ...
## $ sched_dep_time: int [1:327346] 515 529 540 545 600 558 600 600 600 600 ...
## $ dep_delay : num [1:327346] 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
## $ arr_time : int [1:327346] 830 850 923 1004 812 740 913 709 838 753 ...
## $ sched_arr_time: int [1:327346] 819 830 850 1022 837 728 854 723 846 745 ...
## $ arr_delay : num [1:327346] 11 20 33 -18 -25 12 19 -14 -8 8 ...
## $ carrier : chr [1:327346] "UA" "UA" "AA" "B6" ...
## $ flight : int [1:327346] 1545 1714 1141 725 461 1696 507 5708 79 301 ...
## $ tailnum : chr [1:327346] "N14228" "N24211" "N619AA" "N804JB" ...
## $ origin : chr [1:327346] "EWR" "LGA" "JFK" "JFK" ...
## $ dest : chr [1:327346] "IAH" "IAH" "MIA" "BQN" ...
## $ air_time : num [1:327346] 227 227 160 183 116 150 158 53 140 138 ...
## $ distance : num [1:327346] 1400 1416 1089 1576 762 ...
## $ hour : num [1:327346] 5 5 5 5 6 5 6 6 6 6 ...
## $ minute : num [1:327346] 15 29 40 45 0 58 0 0 0 0 ...
## $ time_hour : POSIXct[1:327346], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
## dep_delay humid visib wind_gust wind_speed
## dep_delay 1.00000000 0.122166383 -0.09745755 0.02441787 0.021001914
## humid 0.12216638 1.000000000 -0.45001107 -0.05383642 -0.009770943
## visib -0.09745755 -0.450011070 1.00000000 -0.06001602 -0.063961212
## wind_gust 0.02441787 -0.053836421 -0.06001602 1.00000000 0.874430293
## wind_speed 0.02100191 -0.009770943 -0.06396121 0.87443029 1.000000000
## wind_dir -0.06842713 -0.411605970 0.20250473 0.14504334 0.116032873
## air_density -0.06533380 -0.100994602 -0.02646630 0.29750449 0.275989839
## wind_dir air_density
## dep_delay -0.06842713 -0.0653338
## humid -0.41160597 -0.1009946
## visib 0.20250473 -0.0264663
## wind_gust 0.14504334 0.2975045
## wind_speed 0.11603287 0.2759898
## wind_dir 1.00000000 0.2012071
## air_density 0.20120708 1.0000000
The correlation matrix examined correlation between departure delays and environmental factors. The results show environmental factors like humidity, air density, visibility, and wind direction have weak to moderate correlation with flight delays. However, wind speed and wind gust show no correlation, thus these variables omitted from the models providing us with following equation to build our models: \(\text{dep delay} = \beta_0 + \beta_1 \cdot \text{humidity} + \beta_2 \cdot \text{visibility} + \beta_3 \cdot \text{wind gust} + \beta_4 \cdot \text{wind speed} + \beta_5 \cdot \text{wind direction} + \beta_6 \cdot \text{air density} + \epsilon\).
Three linear models were built, and forward selection was employed for model selection, providing the predictive departure delay for for each airport model:
The model for Newark Liberty Airport (‘lm.ewr.01’) is given by: \(\hat{y}_{\text{EWRdep delay}} = \hat{\beta}_0 + \hat{\beta}_1 \cdot\text{humid}\).
The model for LaGuardia Airport (‘lm.lga.01’) is given by: \(\hat{y}_{\text{LGAdep delay}} = \hat{\beta}_0 + \hat{\beta}_1 \cdot\text{humid} + \hat{\beta}_2 \cdot\text{wind dir}\).
The model for John F. Kennedy Airport (‘lm.jfk’) is given by: \(\hat{y}_{\text{JFKdep delay}} = \hat{\beta}_0 + \hat{\beta}_1 \cdot\text{visib} + \hat{\beta}_2 \cdot\text{wind dir}\).
Model Fit: Following the building of our models, residual plots were used to evaluate model fit of the models and identify issues. For the Residuals vs. Fitted Plots, slight curvatures are detected, suggesting some non-linearity in the relationship between the predictors and response variable. This indicates that the linear models may not fully capture underlying patterns in the data. For Normal Q-Q Plots, the residuals deviate from the line at the tails, highlighting some deviation from normality, this could suggest the presence of outliers or a skewed distribution of residuals. For Scale-Location Plots, the spread of the residuals increases as fitted values increase, this suggests some heteroscedasticity. This suggests that variability of residuals is not constant, which can affect the validity of the regression analysis. For the Residuals vs Leverage Plots, we can see a presence of influential observations, which can affect the model’s coefficients and overall fit. All trends were similarly observered across all three models. To address these issues, summary statistics (R-squared and Mean squared error) were employed through a validation set to provide further insight into the models performance.
##
## Call:
## lm(formula = dep_delay ~ humid + air_density + visib + wind_dir,
## data = delay_EWR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -60.58 -27.95 -17.59 10.19 349.65
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.495e+01 3.295e+01 1.364 0.173
## humid 4.623e-01 1.174e-01 3.939 8.77e-05 ***
## air_density -1.784e+03 2.248e+03 -0.794 0.428
## visib -1.282e+00 1.711e+00 -0.750 0.454
## wind_dir 6.920e-03 2.222e-02 0.311 0.755
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 48.69 on 995 degrees of freedom
## Multiple R-squared: 0.02733, Adjusted R-squared: 0.02342
## F-statistic: 6.989 on 4 and 995 DF, p-value: 1.505e-05
##
## Call:
## lm(formula = dep_delay ~ humid + air_density + visib + wind_dir,
## data = delay_LGA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -62.68 -31.50 -18.62 9.05 619.94
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.013e+02 3.853e+01 2.629 0.008704 **
## humid 6.401e-01 1.422e-01 4.502 7.53e-06 ***
## air_density -1.009e+04 2.738e+03 -3.685 0.000241 ***
## visib 9.158e-01 1.843e+00 0.497 0.619289
## wind_dir 3.619e-02 2.520e-02 1.436 0.151236
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 57.88 on 995 degrees of freedom
## Multiple R-squared: 0.03822, Adjusted R-squared: 0.03435
## F-statistic: 9.884 on 4 and 995 DF, p-value: 7.624e-08
##
## Call:
## lm(formula = dep_delay ~ humid + air_density + visib + wind_dir,
## data = delay_JFK)
##
## Residuals:
## Min 1Q Median 3Q Max
## -59.13 -27.94 -17.07 12.84 254.67
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.918e+01 3.095e+01 3.204 0.00140 **
## humid 1.115e-01 9.131e-02 1.221 0.22235
## air_density -3.392e+03 2.475e+03 -1.371 0.17080
## visib -2.663e+00 8.655e-01 -3.077 0.00215 **
## wind_dir -2.352e-02 1.819e-02 -1.293 0.19642
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 44.91 on 995 degrees of freedom
## Multiple R-squared: 0.03257, Adjusted R-squared: 0.02868
## F-statistic: 8.375 on 4 and 995 DF, p-value: 1.204e-06
## Start: AIC=7795.78
## dep_delay ~ 1
##
## Df Sum of Sq RSS AIC
## + humid 1 63728 2361733 7771.2
## + visib 1 22817 2402643 7788.3
## + wind_dir 1 10156 2415304 7793.6
## <none> 2425460 7795.8
## + air_density 1 2576 2422884 7796.7
##
## Step: AIC=7771.15
## dep_delay ~ humid
##
## Df Sum of Sq RSS AIC
## <none> 2361733 7771.2
## + air_density 1 1059.77 2360673 7772.7
## + visib 1 990.11 2360742 7772.7
## + wind_dir 1 48.39 2361684 7773.1
## Start: AIC=8152.86
## dep_delay ~ 1
##
## Df Sum of Sq RSS AIC
## + humid 1 83311 3383040 8130.5
## + air_density 1 55526 3410825 8138.7
## + wind_dir 1 9139 3457212 8152.2
## <none> 3466350 8152.9
## + visib 1 6004 3460347 8153.1
##
## Step: AIC=8130.53
## dep_delay ~ humid
##
## Df Sum of Sq RSS AIC
## + air_density 1 41448 3341591 8120.2
## <none> 3383040 8130.5
## + visib 1 3180 3379860 8131.6
## + wind_dir 1 364 3382676 8132.4
##
## Step: AIC=8120.2
## dep_delay ~ humid + air_density
##
## Df Sum of Sq RSS AIC
## + wind_dir 1 6886.8 3334705 8120.1
## <none> 3341591 8120.2
## + visib 1 802.4 3340789 8122.0
##
## Step: AIC=8120.14
## dep_delay ~ humid + air_density + wind_dir
##
## Df Sum of Sq RSS AIC
## <none> 3334705 8120.1
## + visib 1 827.68 3333877 8121.9
## Start: AIC=7639.4
## dep_delay ~ 1
##
## Df Sum of Sq RSS AIC
## + visib 1 47763 2026588 7618.1
## + humid 1 42044 2032307 7620.9
## + wind_dir 1 24074 2050276 7629.7
## + air_density 1 10190 2064161 7636.5
## <none> 2074351 7639.4
##
## Step: AIC=7618.11
## dep_delay ~ visib
##
## Df Sum of Sq RSS AIC
## + wind_dir 1 11643.7 2014944 7614.3
## + humid 1 11141.9 2015446 7614.6
## + air_density 1 9500.3 2017088 7615.4
## <none> 2026588 7618.1
##
## Step: AIC=7614.35
## dep_delay ~ visib + wind_dir
##
## Df Sum of Sq RSS AIC
## + air_density 1 5153.9 2009791 7613.8
## + humid 1 4372.2 2010572 7614.2
## <none> 2014944 7614.3
##
## Step: AIC=7613.79
## dep_delay ~ visib + wind_dir + air_density
##
## Df Sum of Sq RSS AIC
## <none> 2009791 7613.8
## + humid 1 3007.2 2006783 7614.3
##
## Call:
## lm(formula = dep_delay ~ humid, data = delay_EWR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56.24 -28.07 -17.25 10.17 351.56
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.44087 4.66697 2.880 0.00406 **
## humid 0.49434 0.09526 5.189 2.56e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 48.65 on 998 degrees of freedom
## Multiple R-squared: 0.02627, Adjusted R-squared: 0.0253
## F-statistic: 26.93 on 1 and 998 DF, p-value: 2.556e-07
##
## Call:
## lm(formula = dep_delay ~ humid + wind_dir, data = delay_LGA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -60.88 -31.58 -19.30 7.64 615.69
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.581226 10.360302 0.925 0.355
## humid 0.605232 0.129128 4.687 3.16e-06 ***
## wind_dir 0.007937 0.024221 0.328 0.743
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 58.25 on 997 degrees of freedom
## Multiple R-squared: 0.02414, Adjusted R-squared: 0.02218
## F-statistic: 12.33 on 2 and 997 DF, p-value: 5.127e-06
##
## Call:
## lm(formula = dep_delay ~ visib + wind_dir, data = delay_JFK)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65.68 -28.03 -17.49 12.66 253.51
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 75.93543 7.47569 10.158 < 2e-16 ***
## visib -3.13529 0.74986 -4.181 3.15e-05 ***
## wind_dir -0.03901 0.01625 -2.400 0.0166 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 44.96 on 997 degrees of freedom
## Multiple R-squared: 0.02864, Adjusted R-squared: 0.02669
## F-statistic: 14.7 on 2 and 997 DF, p-value: 5.121e-07
Validation Fit To evaluate predictive performance of the models, predictions were assesed using a validation set, with metrics such as Mean Squared Error (MSE) and adjusted R-squared values being calculated for both the training and test sets. Lower MSE values indicate better model performance. In this analysis, the models ‘lm.ewr01’, ‘lm.lga01’, and ‘lm.fjk01’ all exhibited lower test MSE compared to the full models. However, all models demonstrated low adjusted R-squared values, indicating potential issues with model fit or data quality.
## Warning in y - yhat: longer object length is not a multiple of shorter object
## length
## Warning in y - yhat: longer object length is not a multiple of shorter object
## length
## Warning in y - yhat: longer object length is not a multiple of shorter object
## length
## Warning in y - yhat: longer object length is not a multiple of shorter object
## length
## Warning in y - yhat: longer object length is not a multiple of shorter object
## length
## Warning in y - yhat: longer object length is not a multiple of shorter object
## length
Validation Fit Summary When examining the variance explained by the models, it was found that approximately 2.35% of the variance in flight delays at Newark Liberty Airport is explained by humidity. For LaGuardia Airport, about 2.10% of the variance is explained by humidity and wind direction. For John F. Kennedy Airport, approximately 3.65% of the variance is explained by visibility and wind direction. All models indicate additional variables might be influenced by other factors.
Overall, while the models provide some predictive capability the low adjusted R-squared values suggest that further adjustments and inclusion of additional predictors could enhance the models performance.
Our findings indicate that environmental factors such as humidity, visibility, and wind direction do have significant effects on flight delays. However, these effects differ across airports. Additionally, The models for each airport showed varying values with adjust R-squared values indicating the proportion of variance in flight delays explained by the models. The residual error provide information concerning the accuracy of model predictions.
| Airport | Model | Significant.Predictors | Intercept | Predictor.Estimates | R.squared | Adjusted.R.squared | F.statistic | Residual.Standard.Error |
|---|---|---|---|---|---|---|---|---|
| EWR | lm.ewr.01 | humid | 15.5746 (p < 0.01) | humid: 0.5035 (p < 0.001) | 0.02452 | 0.02354 | 25.09 (p < 0.001) | 50.57 |
| JFK | lm.jfk.01 | humid, wind_dir | 80.38946 (p < 0.001) | visib: -2.87115 (p < 0.001), wind_dir: -0.07205 (p < 0.001) | 0.03840 | 0.03648 | 19.91 (p < 0.001) | 43.63 |
| LGA | lm.lga.01 | visib, wind_dir | 18.1891 (p = 0.05) | humid: 0.49128 (p < 0.001), wind_dir: -0.01727 (p = 0.42) | 0.02302 | 0.02106 | 11.74 (p < 0.001) | 52.45 |
At Newark Liberty Airport, humidity was identified as a significant predictor of departure delays, with a p-value of less than 0.05. For LaGuardia Airport, both humidity and wind direction were significant predictors, also with p-values less than 0.05. At John F. Kennedy Airport, visibility and wind direction emerged as significant predictors of flight delays, again with p-values less than 0.05. Based on the analysis, we reject the null hypothesis. As the evidence suggests that at least one of the environmental factors significantly affects flight delays across airports.
Flight delays are influenced by multiple interrelated factors including weather, aircraft characteristics, holidays, and airport-specific conditions. This analysis provides valuable insights that can help stakeholders identify key areas for improving on-time performance and reduce delays at airports.
As a backup idea, we are planning on seeing if there is any correlation between the amount of delays present in the different airports. Our data deals with the airports EWR, JFK, and LGA which are all different airports within New York City. Our first question is to figure out if the JFK airport has a different amount of delays compared to LGA or EWR if there is a higher amount of precipitation in the JFK area. Although all the airports are in New York, within the different areas of the city, there can be different amounts of precipitation and rainfall that occur. Our second question is to decide whether the different airports have different models of planes and if the difference affects the amounts of delays. For example if a plane is older or a different configuration, does that lead to more delays due to cleaning or maintenance? And lastly, our third question is whether the three different airports have different airlines coming in and out and if these differing airlines affect the amount of delays present on a given day. For example, if Delta services one airport and not another, does that increase or decrease the amount of total delays for an airport. These questions can be further investigated if our first set of questions are not approved or if we need more content to explore within our project. These sets of backup questions will further explore the flight data we have.